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Abstract 


Applications like distributed computing need frequent and intensive transaction of 
data over a communication network Schemes like Message Passing Interface (MPI) pro 
vide communication libraries, in addition to others to effect distributed computing over 
a network Most implementations of these libiaiies use TCP/IP protocols for transport 
and network layei functions while the libraries themselves reside in the application layer 
Since TCP/IP is designed to work reliabh in \er> large networks too it is bound to 
be slow and inefficient for small high performance reliable networks Due to this the 
transport layer becomes the bottleneck in the computing speed a considerable amount 
of time IS spent in communicating 

This thesis designs and implements a lightweight communication protocol LeghtCom 
mumcator to substitute the heavyweight TCP/IP stack in distributed computations ovei 
small, high speed LANs The substitute offers the same reliability characteristics as that 
of TCP/IP but has a piocessing delay of half that of TCP/IP 

In the design of LightCommumcator we assume that all communication is over the 
same LAN 
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Chapter 1 


Introduction 


To support parallel programming o\er a netv\ork of computers se\eral schemes ha^e 
been proposed Most of these schemes pro\ ide a library of functions to facilitate inter 
processor communication and constructs for parallel programming to a standard high 
level programming language Examples of such schemes include the Message Passing In 
terface (MPI) Parallel Virtual Machine (PVM) Chameleon Chimp Zipcode etc These 
include basic functions to initiate a session of parallel execution of program sending 
and receiving messages closing a session etc A.part from these basic functions some 
advanced functions like for example functions to create separate subgroups functions 
for process management and error handling functions for collective messaging, etc are 
also provided These schemes are designed primarily for master slave architectures and 
support a variety of interconnection topologies Although the performance of the parallel 
computer formed using these libraries is not as good as that of multiprocessor computers 
they offer a cheap and useful alternative 

At IIT Kanpur MPI has been extensi\eh used for distributed scientific computing 
It is generally believed by this user communit\ that the effectiveness of a distributed 
computing setup is not completely e\ident and we believe that this could be because 
MPI uses TCP/IP in its communication hbrar\ This seems reasonable because TCP/IP 
IS a general purpose communication protocol suite meant for bidirectional communication 
even across hostile networks To improve the communication efficiency of the MPI library 
especially when the computers are connected o^er a high performance LAN, we decided to 
develop LightCommumoator, a lightweight communication protocol stack that interfaces 
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like thfc TCP/IP but improves upon its dela^ performance especialh in distributed 
computing en\iionments using MPI 

The MPI standard is a complete interface for message based interprocess commu 
nication specified b> a multilateral gathering of parallel computing users vendors and 
researchers The MPI standard provides true portability to parallel programs The 
message passing paradigm of MPI imparts a distributed memory characteristic to the 
multicomputer which adds to the portabiliU Most MPI implementations are designed 
to also work on heterogf neous systems The MPI standard is flexible enough to run on a 
network of computers as veil as on standalone multiprocessor computers MPI proiides 
convenient C and Fortran 77 bindings for interface 

The MPI standard does not include implementation issues To ha\e a closer look 
at the implementation we vill consider LAM MPI (Local Area Multicomputer/Message 
Passing Interface) developed at Ohio Supercomputer Center It is vritten in two layers 
The upper layer is portable and independent of the communication subsjstem It inter 
faces to the lover la\er through the Request Progression Interface (RPI) consisting of 
eight functions This interface uses Internet domain TCP sockets as the communication 
subsystem using the TCP/IP as the underlying protocol 

Most MPI implementations like LAM MPI for example, proyide a parallel program 
ming emironment o\er a netvork and use TCP/IP for the transport and network laAer 
protocol for data transaction This makes the implementations portable because TCP/IP 
is a videly used communication protocol Thus the MPI librarc functions he entireh in 
the application la-^er The part of the implementation that deals vith the communica 
tion issues consists mamh of routines which interface the MPI implementation v ith the 
TCP/IP stack of the kernel 

Since TCP/IP is designed for a vide vaneU of applications, it ma^ lead to poor end 
to end throughputs in certain applications In applications like distributed computing 
with message passing, low communication dela\ is a necessity This is because in such 
systems, memory is distributed and to maintain coherence, frequent message passing 
IS necessary Also, computations which need parallel programming typically process 
large amounts of data and these need to be exchanged during the computation Thus 
a distributed computing enyironment incorporates an intensiye and frequent message 
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passing In such situations small inefficiencies in communication can lead to large overall 
delays In addition to the deld>& due to the natuie of the communication protocol dela\s 
can also be increased due to implementation inefficiencies For example the number of 
copies of a message that are made while it traverses through the stack depends on the 
implementation and not on the protocol characteristics 

In this thesis we present the design and implementation of a lightweight communi 
cation protocol called LightCommumcator for distributed computing applications over 
friendly LANs LightCommumcator will provide a reliability to the applications simi 
lar in capability to that of TCP/IP without offerring the overheads of TCP/IP in an 
intensive, frequent and unidirectional data transaction over a local area networks Light 
Communicator is implemented inside the Linux kernel The final implementation fits in 
the kernel like other standard protocols alreadv implemented m the Linux kernel Thus 
the protocol can coexist m the kernel with other communication protocols LightCom 
mumcator as seen by the application lajer, is exactly similar to TCP/IP but differs from 
it in implementation This feature minimises the changes required m the source code of 
the application later programs to replace TCP/IP bj LightCommumcator 

The next chapter discusses various overheads of TCP/IP and why they are unneces 
sary in distributed computing environments Chapter 3 discusses the implementation of 
the network module in the Linux kernel It also discusses the mechanism with which the 
requests for data transaction are processed m the Linux kernel Chapter 4 explains the 
details of LightCommumcator and its implementation The last chapter presents some 
experimental results that study the delay performance of LightCommumcator vis a \ is 
TCP/IP It also includes summary of the work and suggests some future developments 
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Chapter 2 


The TCP/IP Protocol Suite 

2 1 Introduction 

In early 1980’& the A.dvdnced Research Projects A.gencv (A.RPA) of the US Department 
of Defence (DoD) specified a new family of protocols as the standard for the ARPANET 
sponsored by it This protocol suite, accurateh known as DARPA Internet protocol 
suite’ is wideh known as the TCP/IP protocol suite or simph TCP/IP {Transport 
Control Protocol/Internet Protocols) 

TCP/IP has many interesting features It is not vendor specific and can be imple 
mented on lowlj personal computers as well as on large supercomputers, it can be used 
for both LANs and WANs Today TCP/IP links millions of nodes around the globe us 
mg all possible physical media like telephone lines coaxial cables (Ethernet links), fiber 
optic cables and satellite channels One of the reasons of the popularity of TCP/IP was 
its inclusion in the BSD Unix systems very earh Subsequently its inclusion in other 
operating systems like Unix System V added to its popularitj 

The general nature of TCP/IP is obviously implied by the extensive service it pro 
\ides to a variety of networks supporting a wide spectrum of applications like HTTP 
FTP, STMP etc This capability for extensive service makes the TCP/IP protocol suite 
too sophisticated for small, reliable, high speed networks because most of its features 
become redundant and eventually become the bottleneck in achieving high end-to end 
throughput 
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2 2 Organisation of TCP/IP 

The basic skeleton of the TCP /IP is a simplified OSI stack containing onp four layers \ iz 
application lajei transport lajer network and the data link layer [3] (See Figure 2 1) 



Figure 2 1 Simplified 4 layer model used by TCP/IP 

There are many protocols in the TCP /IP suite and comprise a family referred to as 
the TCP/IP family This family is a collection of many members organised as sho-wn in 
Figure 2 2 Transmission Control Protocol (TCP) is a connection oriented protocol that 
proyides a reliable full duplex byte stream for a user process It is the most widely used 
protocol of this family User Datagram Protocol (UDP) is a connectionless protocol for 
user processes It is unreliable i e there is no guarantee that a datagram sent will reach 
the destination Internet Control Message Protocol (ICMP) is used to handle error and 
control information between gateways and hosts This protocol is not meant to serve user 
processes but carries messages generated by the TCP/IP networking software Internet 
Group Management Protocol (IGMP) is the part of TGP/IP dealing with the broadcast 
and multicast functions of the protocol 

Internet Protocol (IP) is the network layer serving the transport layer occupied by 
TCP, UDP, ICMP and IGMP It does not carry the messages produced by user processes 
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directh Nodes in a TCP/IP network or Internet have an address called the IP address 
This IS used to uniquely identify an Internet connection This address is different from the 
hardware (data link layer) address used b> a node in the network to which it is connected 
Address Resolution Protocol(ARP) is a protocol to serve the queries of IP about the 
hardware addresses to be assigned to an outgoing packet Reverse Address Resolution 
Protocol (RARP) is a protocol used to map hardware addresses to IP addresses 



Figure 2 2 TCP/IP protocol family 
The above layout follows the four layer model shown in figure 2 1 


2 3 The Network Layer - IP 

The IP layer provides an unreliable delivery of the data streams supplied bj the transport 
layer The IP layer does not keep any account of the packets sent by it An outgoing 
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packet IS completely independent of the preceding and the succeeding packets No se 
quencing is done in this layer Reliabilitj of data delivery is provided by the upper 
layers 

For addressing IP uses a 32 bit address, which is unique for each network connection 
and IS organised in four categories viz Class ABC and D addresses The IP address 
has some bits dedicated to denote the class to which it belongs some to the network and 
rest to the host Such an organisation is necessary because of the size of the internetwork 
and the variety of networks served by the protocol An IP packet contains a 20 (and 
some options if they exist) byte IP header followed by the data to be transmitted 

The basic function of the IP layer is to do a translation of the IP addresses to the ' 
hardware address to which the packet has to be sent Since hardware addresses are not ’ 
unique and m more general cases when there are different networks e g , Ethernet Token , 
Ring etc , there might not be any relation between two hardware addresses This mapping I 
may also be dynamic in time ARP and RARP provide the mapping on demand for IP j 
ARP and RARP are full fledged protocols in themselves (This mapping will need to be 1 ' 
to be done at every router that a packet passes through on its way to the destination ) In 
a LAN environment, the ARP and ARP processing are not required because the nodes 
have unique hardware addresses and a host can be uniquely identified by this address 

Another important function of IP is fragmentation of outgoing packets This is nec 
cessiated by the fact that the data link layer might not allow arbitrarily long packets to 
be transmitted on the network The packet size is limited by the Maximum Transmis 
Sion Unit (MTU) of the data link layer IP is also equipped to handle situations where a 
packet moves from a network of higher MTU to a network which that supports a lower 
MTU This power of IP also is of no use if the packet does not cross a bridge 

There are many more functions of IP like forwarding, transparent proxying, mas 
querading etc which aie of no use when a packet moves within a LAN 

2 4 The Transport Layer - TCP and UDP 

TCP IS more complicated than IP The strength of the family is provided by this protocol 
It performs a wide range of functions It is highly reliable and has powerful mechanisms 


7 



to control the flow of data 

TCP IS a connection oiieiited protocol i e two applications before using TCP for data - 
transaction establish a connection Establishment of connection is a three wa\ process 
Unless a connection is established two terminals cannot communicate with each other 
After the data transaction is complete, they ha^e to terminate the connection A more 
detailed discussion w ill be done in § 2 4 1 and § 2 4 2 

To ensure a reliable flow of data TCP breaks the application data into small segments 
before delnenng it to IP A TCP segment contains a TCP header of 20 bytes (and some 
option bytes if an^ ) followed by data On the receiver side, TCP sends acknowledgments 
for the data it receives The process of sending acknowledgments is shghtlj involved 
and w ill be discussed in § 2 4 4 A TCP transmitter can send segments whose sequence 
numbers he within a window starting from the last segment number for which the ac 
knowledgment was received After sending a segment it starts a timer and if before 
the expiry of the timer it does not receive an acknowledgment it will retransmit the 
segment These retransmissions can cause duplicate instances of same segment at the 
destination which are to be discarded by TCP If segments arrive out of order they ha\ e 
to be rearranged To ensure in order and reliable delivery of data, TCP also maintains a 
checksum of the w hole segment 

TCP IS also responsible for controlling congestion m the network This it does by 
dynamically changing the window size and the timer timeouts Several techniques like 
delayed acknowledgments Nagle s rule, piggybacking etc are used to improve end to end 
throughput by TCP 

\ TCP segment is handed ovei to IP layer, which adds its own header and gives it to 
the data link lay er The data link layer adds its header and trailer and finally transmits it 
on the network Figure 2 3 shows the encapsulation of data as it goes down the protocol 
stack 

2 4 1 Connection Establishment 

TCP uses a three way handshaking m connection establishment procedure First, a 
request is made from the client side for a connection by sending a SYN signal then the 
server acknowledges the request by sending a SYN and an ACK showing its readiness to 
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Figure 2 3 Encapsulation of data as it goes down the protocol stack 

establish a connection Finally the client acknowledges by sending an acknowledgment 
If this process is not complete within the time frame determined by the timeouts of both 
the ends the establishment of connection fails 

To handle nasty situations a three wav handshake is essential It is obMOus that 
two way messaging is compulsory The third message ensures that the client’s timer for 
reception of the acknowledgment for its request has not expired and it is still interested 
m connecting In the relatively congenial situation of a LAN such a complicated pro 
cedure is not necessary because of low delays and high data reliability If the hosts are 
up the requests and acknowledgments reach in time and the complicated connection 
establishment procedure of TCP/IP will be a waste of time In reliable networks the 
connection establishment procedure can be made similar to the initiation of conversation 
on telephone The caller dials the phone number and the bell rings on the other side 
indicating the request to make a connection The person who is being called picks up the 
phone and says ‘ hello” which is similar to sending an acknowledgment Now the caller 
can straightaway start the conversation if there is no chance of a wrong number Thus 
in a LAN, where the probability of delay is negligible it is quite evident that a two wav 
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handshake is sufficient to provide enough reliabilitj in connection establishment Even if 
something goes wiong during the connection establishment timers ma\ be used to take 
care of the situation as explained in § 2 4 b 

2 4 2 Termination of Connection 

Whereas connection establishment uses a three way handshake connection termination 
uses a four way handshake According to the algorithm the one who initiates the request, 
sends a FIN signal shoving its will to close the connection The other end acknowledges 
the reception of FIN After the initiator sends the FIN it can not send any more data 
but the other end can continue to transfer data over the link Finally the other end also 
sends the FIN to close the connection After the other end acknowledges the request the 
connection is closed 

A four way handshake is essential to close a TCP connection because it is a full 
duplex connection A TCP connection (which is full duplex) can be considered as a 
combination of tvo back to back half duplex connection and its termination is equivalent 
to terminating two half duplex connections This is similar to termination of a telephone 
conversation Neither of the two parties hang up the phone before both have finished 
Before disconnecting the caller and the called vill ensure that they ha\e nothing to sa^ 
by asking if the other has something else to say 

In situations where the transaction is unilateral, a two way termination is also suffi 
cient In a unilateral connection the sender vill open a connection only if it has some 
data to transmit When it finishes it will close the connection To ensure a proper 
termination, the receiyer may acknowledge the termination A unilateral transaction is 
similar to writing a letter In that case the letter terminates without the consent of the 
reader and is completely at the wish of the writer 

2 4 3 Transmission and Reception of TCP Segments 

The transport layer makes transport layer frames and hands it to the IP layer The IP 
layer appends its own header and the hardware header and hands it oyer to the de\ice 
layer or the deyice dnyer of the operating system The resulting packet that is finalh 
transmitted on the physical medium is show n m Figure 2 3) The frame giyen to the 
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IP layer has the transport layer header (that contains the checksum for the segment and 
the TCP header) and the application layer data The IP la>er checks the routing table 
and finds the next hop device for the packet It then fragments the data to satisfy the 
size requirements of the network of the device (i e size of all packets transmitted on the 
network should be less than the MTU of the neti\ork) and fills the IP header using the 
information provided bj' the transport la>er and the information generated by itself (i e 
from routing table) After filling the IP header it appends the hardware header which 
it obtains from the routing table entry If the required hardware header is not available 
in the routing table IP uses ARP to obtain the information 

The complete packet is handed over to the device driver (the device trailer is put by 
the device itself and not the IP layer) The packet as composed by the IP laver is now 
the responsibilit} of the device layer which finalh transmits it on the ph\ sical medium 
An identical model of la\ers exists on the receiver side 

2 4 4 Acknowledgments 

To ensure deliver} of segments TCP client (receiver) acknowledges reception To im 
prove the throughput, larious techniques are used which do deteriorate the throughput 
performance in a lightly loaded LAN especially when the data flow is voluminous and 
unidirectional 

To use bandwidth efficiently, TCP piggybacks acknowledgments on a data which ma\ 
be transmitted in case of bidirectional data flow Ver'v often data might not be available 
for piggybacking the acknowledgment routine waits till it gets data or a timer specificalh 
set for it expires This is known as delayed acknowledgment This is advantageous when 
the data transaction is interactive like for example in telnet, ftp etc Howevei when the 
data flow is effectuely unidirectional an acknowledgment is sent only when the delayed 
acknowledgment timer expires adding unnecessary delays 

Nagle s rule is another approach to increase efficiency Under this rule, small frames 
are not sent and the software waits for more data to be sent by the application layer 
After enough bytes are accumulated or a timer set for the purpose expires, a packet is 
sent 

To avoid unnecessary load created by the acknowledgments, an intermittent acknowl 
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edging may be done i e all packets received need not be explicith acknow lodged 
knowledgment in TCP is such that whenever a packet is acknowledged all the packets 
that precede it are implicitly acknowledged Thus explicit acknowledgments for succes 
sively received packets may be eliminated by acknowledging the last packet m the set 
This IS also called cumulative acknowledgment 

2 4 5 Flow Control 

TCP IS a protocol designed to work on a network as large as the global Internet and hence 
has strong flow control capabilities It controls flow of data by dynamically changing the 
window size and the maximum segment size 

TCP uses sliding window flow control with dynamic window size The algorithm 
prohibits the transport layer to transmit a segment beyond a window whose starting 
point IS deflned by the last segment for which the acknowledgment has been received 
The size of window depends on how quickly a packet is acknowledged and this is tracked 
using an estimate of the round trip time In case there is congestion the estimate 
of the round tup becomes large and the window size is decreased so that less data is 
transmitted into the network and this is expected to decrease congestion This round 
trip time estimate is also useful m setting the retransmit timers because if the expir\ 
iime of the retransmit timer is less than the time taken b} the acknowledgment to reach 
the sender there will be continuous retransmission 

The estimates of the round trip time have to be done continuously and are efl'ectnely 
overheads m LANs because here the acknowledgment is received fairly quickly and it is 
rare that the transport layer has transmitted one window of segments and is waiting for 
the window to slide However, TCP will always make these estimates 

Similarly the segment size can also be used for flow control Larger segments increase 
the load on the network and small segments will increase the number of acknowledgments 
Many TCP implementations do not send acknowledgments for every segment they receive, 
and use cumulative acknowledgments instead Although this decreases congestion and 
improves the end to end throughput, it increases the processing overhead at the receiver 
For example, Linux uses per octet acknowledgments which improves the throughput but 
increases the processing overhead 
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2 4 6 Timers 


To cope up situations ■where in middle of a session one of the hosts stops responding TCP 
has some timers Apart from these timers there are timers to implement the techniques 
used to enhance the throughput of the link In the following we discuss the \arious timers 
used by TCP 

TCP transmitter maintains a timer at the transmitter which is actuated 'when a 
packet IS sent If the acknowledgment is not received before this timer expires it retrans 
mits a random packet within the transmit window The expiry time of the retransmission 
timer is not fixed and is varied according to the estimate of the round trip time and a 
pessimistic value is chosen m the beginning 

There is a partial queue timer which is set w hen a request to send a partialh filled seg 
ment is submitted The packet instead of being transmitted instantaneously is queued 
If before the expiry of the this timer enough data is a\ ailable to make a full segment it 
will be transmitted Otherwise the partial queue is transmitted 

A delayed acknowledgment timer is maintained to see if an acknow ledgment can be 
piggybacked on a data packet For this, if data is not immediately available for trans 
mission, TCP will wait for some time Thus the delayed acknowledgment timer is set 
whenever an acknowledgment is to be transmitted If data is available before this timer 
expires the acknow ledgment is piggybacked on it otherwise the acknow ledgment is sent 
on the expiry of the timer 

There is a keep alive timer to see that the connection is not idle for a long time If 
one sender is not sending any data for a long time the receiver uses this timer to close 
the connection This is useful when a client sleeps while it was being ser\ed b-y a remote 
server 

In real implementations some more timers are used than mentioned above For ex 
ample Linux uses two more timers One at the IP layer to ensure continuous reception 
of IP fragments and one general purpose tuner 

These timers are seived by a separate process belonging to the scheduler software 
This section of the code maintains a list of all the active timers and checks the expiry 
of all the timers at each clock tick When a timer expires, it calls a function told to it 
when the timer was initialised and the timer is deleted from the list Longer the list 
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more will be the load on the scheduler and Imnce the processoi This will in turn effect 
the performance of othei processes In a niassne unidirectional transfer over a reliable 
LAN the partial queue timer and the dela\ed acknowledgment timer are useless The 
keep alive timer is also of no use Thus in such cases two timers one transmit timer 
at the transmitter side and a receive timer at the recener side is sufficient This will be 
discussed in more detail in § 4 4 5 and § 4 5 3 

2 4 7 TCP State Diagram 

The above discussion shows the complexities of the TCP/IP protocol which is ineMtable 
because the protocol is designed to work rehablv on a variety of large and small networks 
The extent of the complexity is evident from the TCP state diagram shown in Figure 2 4 
It is a state digram with eleven states and twenty transitions' It is interesting to obser\e 
that during a session over a congenial network the set of states of visited will be a small 
subset of that show n m Figure 2 4 If a state machine were to be designed for such 
networks, it will be considerably simpler than the one of Figure 2 4 Some states can be 
altogether removed Some state transitions in the remaining states can also be assumed 
to be nonoccurant and if they occur a strategy to close the connection may be followed 
Thus TCP Itself can be reduced to a lightweight transport protocol which will work very 
reliably on small LANs 

As will be discussed m chapter 4 a reduced set of states and transitions will be 
enough to ensure rehabilit\ of data deliver\ on congenial situations like that in LANs 
This reduction will iniprov e the performance of the transaction and will ensure an efficient 
use of the available bandwidth 


2 5 An Analysis of TCP/IP Implementation Over- 
head 

Many of the overheads in TCP/IP are caused by the particular way in which it is imple 
mented The implementation is bound to be more complex than the protocol This is 
because, to implement details of the protocol, some overheads may be added that reduce 

the efficiency further 
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Figure 2 4 TCP state diagram 

A. study of the Berkeley implementation of TCP as included in LJ^I\ is available in 
[1] In this paper Clark et al count the number of instructions for the normal flow 
path in the TCP state machine Berkeley implementation uses a buffering scheme in 
which data is stored in a series of chained buffers called mbuff s This buffering scheme 
IS obviously the characteristics of the rather than TCP as a protocol 
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2 5 1 Input Processing 

There are three stages of the TCP processing In the first a search is made to find the 
local state information (called the Transmission Control Block or TCB) for this TCP 
connection In the second the TCP checksum is verified This requires computing a 
simple function of all the bytes in the packet In the third stage the packet header is 
processed 

The calculation of the checksum depends on the raw speed of the environment and the 
detailed coding of the computation The calculation is done once the whole application 
layer packet is received 

Searching of the TCB can be speeded up bj maintaining a TCB cache This leads to 
a very light algorithm on an average A study showed that on a workstation in general 
use (opening 5 710 connections over 38 dajs and receiving 353 238 packets) the single 
entry cache matched the incoming packet 93 2For a mail server, which might be expected 
to have a much more dnerse set of connections, the measured ratio was 89 Sand 121 676 
incoming packets) 

The packet input processing code has rather different paths for the sender and recei\er 
of data The overall numbers are the following 

• Sender of data 191 to 213 instructions 

• Receiver of data 186 instructions 

Both sides contain a common path of 154 instructions Of these 15 are either pro 
cedure entry and exit or initialisation For the receiver of the data, an additional 15 
instructions are spent sequencing the data and calling the buffer manager, and another 
17 are spent processing the window field in the packet 

The sender of the data, which is recenmg control information has more steps to 
perform In addition to the 154 common instruction it takes 9 to process the acknowl 
edgment, 20 to process window, 17 to compute the outgoing congestion window (so called 
“slow start’ control), and 44 instructions (but not for each packet) to estimate the round 
trip time The round trip delay is measured not for every packet but only once per round 
trip For short delay paths, where one packet can be sent m one round trip this cost 
could occur for everv acknowledgment Since the Berkeley TCP acknowledges at most 


16 



eA-en other packd in a bulk data transfer the cost in this case is 22 instructions per 
packet For longci piths the cost will be spread o\er more packets so 22 instructions is 
an upper bound 

2 5 2 Output Processing 

The output anah sis is somewhat less detailed than the input side To send a TCP packet 
235 instructions were used This number provides a rough measure of the output cost 
but it IS dangeious to compare it closely with the input processing cost In fact TCP 
puts most of its comple\it} in the sending end of the connection This complexity is not 
a part of part of packet sending but a part of receiving the control information about 
that data in an incoming acknowledgment packet 

2 5 3 Cost of IP 

In the normal case IP performs very few functions Upon inputting of a packet it checks 
the header for correct form extracts the protocol number and calls the TCP processing 
function The executed path is almost always the same Upon outputting the operation 
IS even more simple 

The instruction counts for IP were as follows 

• Packet receipt 57 instructions 

• Packet sending 61 instructions 

It IS not indicated in [1] whether the above are number of instructions in the program 
01 the number of instructions actually executed during the running of the respectne 
programs We believe it is the former because the end-to end delav s experienced during a 
data transaction correspond to the execution time of a much larger number of instructions 
than those indicated above 

The implementation of TCP/IP on Linux is lery much similar to the implementation 
of TCP /IP in BSD It also has data arranged in chain of buffers called sk_buff s Most 
of the implementation overheads that exist for BSD are also existent in case of Linux 
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Chapter 3 


Implementation of Network Module 
in Linux 

3 1 Introduction 

Like all other Unix like systems, Linux also provides neh\ork access to the application 
programs through a gioup of system calls called socket system calls These system calls 
like the file access system calls, provide basic facilities like creating an interface, mak 
mg and closing connections sending and receiving streams of bytes and various control 
functions Each network protocol family separately implements these system calls and 
advertises its implementation using a process of registration with the socket interface of 
the kernel This process of registration makes the socket interface aware of the existence 
of the protocol famih m the kernel and passes on the requests from the application pro 
grams to the protocol family Simultaneously the protocol has to make the device layer 
of Linux also know about its presence m the kernel so that the bottom handler of the 
protocol, can transfer the packets meant for it that arrive over the device The bottom 
handler is that part of the kernel which processes the incoming packets from the de\ ice 
and distributes them to the various protocols that are registered with it The protocol 
family to which an incoming packet belongs is obtained from the hardware header in 
the incoming packet For an Ethernet packet, this information is contained m the 16 
bit protocol field of the Ethernet header The process of registration and referencing of 
function calls is in the ‘ top half" of the Linux network module and will be discussed m 
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§32 The bottom handler and the registration of the protocols with the bottom half of 
the network module will be discussed in § 3 3 

3 2 The Top Half 

The network module of the Linux operating system resides m the net directory of the 
Linux source code The socket interface is pro\ided by socket c which ser\es the requests 
fiom the application layer and passes them on to the protocol specific routines as will be 
explained m § 3 2 3 The structure of network module code is shown m Figure 3 1 The 
protocol aftei acting upon the data given bj the application layer passes it to the device 
handlers resident in the directory net/core (more specifically in dev c) The routine in 
dev c finallv calls the device driver to transmit the packet 



Figure 3 1 Structure of the director} net 


19 


















On the receuei side the same path is followed The device druer calls the routine‘' 
of dev c and delivers the data to it The bottom handler residing there distributes the 
data to the respective protocols The protocols after processing the input data stream 
hand over the data to socket c and the data is finally passed to the application layer 

3 2 1 Registration Procedure 

The protocols register themselves through a structure proto_ops which contains an inte 
ger identifier used by socket c to identify that particular protocol and a set of pointers 
to functions w hich are implementations of the s^ stem calls The registration is done b\ 
a function sock_register defined in socket c which takes a variable of t\pe proto ops 
and adds it to a global array maintained bj socket c Whenever a socket system call is 
made for a protocol family, socket c first finds the entry corresponding to that protocol 
and directs the control to the corresponding function using the pointer fields of the entry 
(which is of type proto.ops) 

There is an initialising routine for each protocol which registers it by calling sock register 
Initialising loutmes coirespondmg to each protocol is aiailable in an arrai maintained bt 
protocol c These routines are called by sock.init which initialises the entire socket 
inteiface sock.init is called when the kernel boots up by the function start_kernel 
(in init/main c) 

3 2 2 Interface to the Application Layer 

The kernel proMcles access to the network through a variable of type struct socket It 
IS known to the application layer through a unique integer for each instance of a socket 
called the socket file desenptor (See Figure 3 1) 

Whenever a request to create a socket of a certain network address famiH is made 
using the system call socket the control goes to socket c socket c then checks that 
particular family in its database If it finds one, it creates a variable of type struct 
socket, assigns a file descriptor for it and records this assignment for future use Type 
struct socket has a field ops of type proto.ops to which socket c assigns the variable 
of type proto.ops known to it at boot up time Thus, after this assignment socket->ops 
points to the structure containing all the implementations of the system calls relevant 
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to tht protocol Tins icference is used b> the socket interface when the s^steln calls are 
made 

Finalh aftci all these assignments are made the create function of the protocol is 
called using socket->ops i e socket->ops->create is called The control then reaches 
the protocol routine There, relevant initialisation is done the details of which will be 
discussed m § 3 5 

3 2 3 Calling System Calls 

After a socket is created in the ops field of the socket pointers to all the functions 
are stored When an} system call is made the application layer specifies the socket b\ 
passing on the socket file descriptor returned by the socket system call 

When a call to send a message is made the control first reaches socket c socket c 
then finds the socket corresponding to the socket file descriptor passed to it After 
getting the socket socket c calls the corresponding function from the pointers available 
in socket->ops eg in this case socket->ops->sendinsg will be called with parameters 
which aie standard for the interface between socket c and the corresponding protocol 
implementation Thus the job of socket c is to direct requests from the application layer 
to particular protocols This redirection is invisible to the user and the user program 
has to specify the protocol only while calling the socket function to create an interface 
01 a socket After creation of a socket all the functions have a similar structure vhen 
viewed fiom the xpphcation layer 

3 3 The Bottom Half 

The bottom half deals with the incoming packets from the device driver The arriving 
packets are dehvoied to the respective protocols The transaction between the device 
driver and the bottom half of the network module is made throngh a structure struct 
sk_buf f This sk_buf f is given to the protocol to which the packet belongs The identifi 
cation of the destination protocol for an arriving packet is from the information provided 
by the protocols at the time of boot up This is done b} the initialising function described 
m 8 3 2 1 
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3 3 1 Registration with the Bottom Handler 


Each protocol, at the time of its initialisation gets registered itself with the bottom half 
of the network module This is done by calling the function dev add pack This function 
takes a structure of type packet type as a parameter This parameter has a field for a 
16 bit identifiei for itself (in Ethernet, the identifier is the same as the protocol field of 
the Ethernet header) and a field for a pointer to a function which has to be called when 
a packet of the type specified by this type field arm es from the network This function 
will process the packet for the protocol When the function dev_add_pack is called it 
adds the information in the abo^e parameter into a list maintained b\ it for future use 
Whenever a packet arrives this information is used to call the handler of the protocol to 
which the packet belongs 

3 3 2 Delivery of Packet to the Protocol Handler 

All the incoming packets are handled by the function net bh which is interrupt drnen 
The transaction between the de\ ice driver and the upper layer is through a data structure 
struct sk-buff as shown in Figure 3 1 When a packet arrives at the de\ice driver it 
allocates an sk buff, fills the necessary fields (e g size protocol etc ) and calls a 
function net if rx defined in dev c netif_rx initialises net_bh to be the bottom half 
handler and marks the interrupt bits Thus net bh is called for every incoming packet 

As explained in § 3 3 1 net.bh has access to a list which contains information about 
the piotocol handlers and their tjpe When a packet arrives, it compares the protocol 
field of the sk_buf f with its database If it can find the protocol in its database it calls 
the corresponding handler and passes the incoming sk_buff to it The sk_buff is now 
the lesponsibihU of the protocol 

At this level, i c , the bottom half of the network module, there can be a possible 
tap to extract the the incoming packets If in the list of protocols registered with the 
bottom handler, there is a protocol of type 13 (defined m the kernel code as ETH_PJiLL) 
the handler of that protocol will be called ever} time a packet arrives irrespective of the 
type field m the hardware header This facihtv is included only to pro\ide a tapping 
point for all the incoming packets There can be more than one registration with type 
ETH_P_ALL 
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3 3 3 Delivery of a Packet to the Device Driver 


Tht previous section dialt with the mechanism by which a packet that has arrived at 
th( dcMte IS gi\en to the respective protocol The current section deals with the process 
by which a packet is gi\en by the upper layer to the device driver 

As explained m § 3 2, the device handling routines reside in the director> net/ core 
The sending of packets is done by the function dev_queue xmit This function performs 
some checks and tries to transmit the buffer b\ calling the device driver explicitly if there 
IS no other packet waiting to be transmitted If there is another packet to be transmitted 
this packet is queued If this packet is being retransmitted it is placed at the head of 
the transmit queue If the de\ice driver fails to send the packet it is again put in the 










3 4 Network Buffer Management 


All the buffers used b\ the networking laeers are sk_buff The control for these rs 
proeided by core low level library routines aeailable to all the the networking routines 
sk-buff provides general buffering and flow control facilities needed by the network 
protocols [2] 

The network deeices are accessed through a variable of type struct device This 
deiice structure contains all the generic information and methods for the network device 

3 4 1 The sk_buf f Structure 

The primary goal of tlu sk-buff routines is to provide a consistent and efficient buffer 
handling mechanism to the network module To be consistent, the higher le\el sk buff 
and socket handling facilities need to be followed by all the protocols in the network 
module The sk buff can be queued forming a doubly linked list for sequential processing 
bv the device dri\er and the protocol An sk-buff is a structure with a block of memory 
attached with it 

There are primarily two types of routines defined m the sk-buff llb^a^^ The first 
category deals with the manipulation of the list of sk_buff and the second deals with 
the manipulation of the memory block associated with it 



Figure 3 3 Flow of sk buff while sending 

The first type of routines, those for manipulation of sk-buff lists, are used in both the 
receiving and sending of data Whenever data is to be sent, the protocol la\er allocates an 
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sk buff, fills it and queues it into a list of sk buff maintained the device structure 
Each device has thiee queues uhich hold packets of different priorities (low dela\ high 
throughput and high reliabilitv) The sk buffs are taken from the head of the device 
queue and transmitted by calling the device driver This is shown in Figuie 3 3 On the 
receiving side, everv incoming packet is placed into the receive queue of the sock for 
the corresponding connection to which the packet (or the sk.buff) is destined This is 
done by the protocol handler as explained in § 3 5 This sk buff is taken out from the 
receive queue of the sock when a receive () is called from the application as shovn in 
Figure 3 4 


User 

Process 



The second tv pc of routines are those that deal with the manipulation of the memorv 
block attached to the sk buff Every sk_buff has three parts viz the head room 
the data area and the tail room These are defined by pointers held in the sk buff 
strucUiie On allocation of an sk buff of a certain size, the entire memory block is 
occupied by the tail As the various headers and the data are filled the data area grows 
and the otheis contract This is primarily done by the routines provided in the buffer 
library 

3 4 2 The device Structure 

All Linux network devices follow the same interface although many of the functions 
available in that interfice will not be needed for all the devices An object oriented 
design IS used and each device is an object with a senes of methods that are filled into 
a structure (the device structure) Each method is called with the device itself as the 
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first irgumnit 

The generic uifoinntion and the methods for each network decice aie kept in the 
device structuic To create a device most of the fields for the methods needs to be 
initialised Each device is identified by a string pointer name like ethO ethl trO After 
filling the methods and variables, the device initialisation routine advertises the de\ice 
by calling the function register netdevO 

The device stiucture contains a block of parameters used to maintain the location 
of a deuce within the device address space of the architecture e g fields to hold irq 
number, dma channel base.address, mem start, mem_end etc A group of parameters 
are provided to be used by net-tools which is a package consisting of tools like ifconfig 
which are used to set and get various device parameters like MTU protocol address 
various flags capabilities etc There are a few parameters that are used bv the protocol 
lay( 1 e g hardJieader-len, pa.addr (the protocol address) etc 

The methods attached with the device consists of the basic routines to set up the de 
vice, fill hardware headei (dev->hard_header()) transmit a frame (dev->hard start_xmit) 
etc These functions are used by the lowest level of the protocol layer to send and receive 
frames Togethei these methods comprise the device driver 

3 5 The Protocol Layer 

The piotocol lavei lies below socket c and above the low level device access routines 
define d in dev c as shown in Figure 3 1 A protocol consists of a set of methods defined 
m dcc 01 dance with the lules laid out by that protocol These methods are contained m 
the proto„ops stiuctuK filled by the protoeol initialising routine as explained in § 3 2 1 
and arc called when a system call is made as explained in § 3 2 3 


3 5 1 The sock Structure 

The system call, as defined in socket c calls these methods with socket as a parameter 
Within the body of the protocol, the data structure used is the sock structure as shown 
on Figure 3 1 When th< socket is created (or opened) a sock is allocated and the pointer 
to It IS stored m the data field of the socket The sock structure contains information 
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relevant to the piotocol e g window size socket MIU timers sequence number of the 
packet sent sequence nuinbei of the last packet for which an acknowledgment has been 
received, etc 

When the socket is opened the system call eventually calls the create () of the 
protocol to which the socket belongs The create!) routine assigns a sock for the 
socket as explained eailier and initialises the \arious fields The sequence numbers are 
initialised to ensuie proper start eg TCP initialises the sequence number of the last 
packet sent as a random number following RFC 793 and the sequence number of the last 
packet foi which acknowledgment has been recened as one less The window size is fixed 
as two (slow start ) The MTU is set to 576 All the timers are initialised 

Apart from these d ita fields the sock has a prot filed which has methods to im 
plement vaiious transpoit layer protocols on tl e same network layer protocol When 
the sock IS mituhsed, the relevant variable is filled in the prot field of the sock For 
example, the TCP callbacks are filled m tep prot and when a TCP socket is opened 
socket->sock->prot is assigned to tep prot Now, when a send is called on a TCP 
socket, socket“>sock->prot->sendinsg() is called 

The prot stiuctuie has an array of sock sock array which contains all the active 
opened connections Thus tep prot->sock array has a sock to which an incoming 
packet has to be delivered Entr} corresponding to a socket into this arra} is made while 
bindO IS called on that paiticular socket 

The sock stiucture contains three lists of sk_buff The write queue the receive 
qiKUC and tin backlog ejueue The write queue consists of those sk buff which are 
not being sent be cause of the limitation imposed by the window size "Whenever an 
acknowledgment comes, an sk.buff is sent from this queue to the device The receive 
queue holds the packets that have arrived from the device These sk_buff are added bv- 
the protocol liandlei sk buff are taken out from this queue and the data is copied in 
the application siea The backlog queue holds those packets which have arrived but 
cannot be put in the receive queue 
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3 5 2 Network Layer Implementation 

The network h\er pio\idcs primitive support for sending and receiving datagrams This 
support IS unieliable and is used by the transport layer which takes care of the reliabilitv 
issues This section makes an o\erview of the implementation of network layer giving 
special references to implementation of the IP layer (Note that the network module of 
Linux IS designed to suite the requirements of TCP/IP) 

The transport layei after filling its header calls the routine to fill the network layer 
header After these headers are filled control passes on to the network layer to put 
the packet on the deyice The network layer does a translation from protocol address 
to hudwarc address To do so it uses its own methods eg IP maintains a routing 
table entry which it looks up while a packet has to be sent To speed up this process 
i held IS thfic in the sock structure to hold a routing table entry which is used as 
cache Before looking the routing table it hrst compares the destination address with 
the routing infoiination stored in this cache If it is releyant it is used otherwise the 
routing table is checked and the entry read recently is stored m the cache Using this 
infoimation, the iietw'ork layer fills the hardware header It allocates an sk-buff , fills the 
hardware header the netwoik layer header and the data (which contains the transport 
layer herder) into the sk buff Finally it sends the packet to the bottom half 

If the data block pionded by the transport layer is larger than the deyice MTU (the 
device to which the packet has to be sent is found from the routing table) it has to 
be fragmenfed into small pieces of size less than or equal to the device MTU Thus the 
netwoik hyei bieaks the data block and send them successiveh on the device On the 
lece lying side, before deliyeiing the packet to the transport layer these fragments are 

reassembled in oiclei 

After reassembly of the fragments, the netwoik layer calls the transport layer protocol 
handlei based on some information proyided by the network layer header e g the IP 
headei has a flcltl for the protocol(TCP, UDP ICMP etc ) The frame now becomes the 
responsibility of the transport layei 



3 5 3 Transport Layer Implementation 

The tianspoit liAcr is built completely over the network laver It adds extra rehablht^ 
and coiitiols the flow of data The transport ia\(r divides the application layer data into 
segments and fills up its header m these segments If the size of the window allow the 
transmission of this segment it is delivered to the network layer otherwise it is added 
to the write queue as explained in § 3 5 1 These segments are tagged by a sequence 
number which is used to ensure a deliverj of the segment If reception of a segment is 
not acknowledged foi a certain amount of time retransmission is done to measure this 
time theie arc timers associated with each socket on whose expiry retransmission For 
letransmission each transport laver protocol has its own algorithm (TCP uses Go back n 
strategy) 

On the letiiition side the network layer gnes the assembled segment to the transport 
layer The tixnsport la^ei finds the segment sequence and correspondinglv sends or 
withholds the acknowledgment Using the sequence number information it arranges the 
segments in oidei and thiows away multiple copies of the same segment 

Once an ack is received, segments are released from the write queue as explained in 
§ 3 5 1 according to the algorithm followed by the protocol 

The transport laver controls the amount of the flow of data by controlling various 
parameters assoc nted with the connection e g window size etc 

These implementations complete the whole map of a protocol from the device laver 
to the transpoil layci The object oriented approach of Linux makes all the protocols 
look same fiom the xpphe ation layer The internal implementation of the protocols also 
follow the object onented approach and thus lead to some kind of similarity m the basic 
stiucturc Moiccnei existence of hbrarj of generic routines to handle common aspects 
like buffcis, sockets timers, queues etc provide a modular and layered structure to the 

implementation 
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Chapter 4 


The Lightweight Communication 
Protocol 


From our discussions in the preceding chapteis two things are evident Firstly the 
gciieuxi puiposc TCP/IP protocol will cause a considerable overhead i\hen used for 
distributed conipiiling applications especialh in LAN environments Secondlj Linu\ 
piovidcs enough flexibility and a clean modular mechanism to implement another com 
mumcation piotocol thit can be used by any application The normal implementation 
of Linux has TCP/IP and Unix piotocols in its kernel In this chapter ve discuss the 
design and iniph mental ion of a lightweight communication protocol specifically for use 
by distributed applications in Linux based sjstem on LANs We call this protocol Light 
Commumcatoi LightC ommunicator is not intended as a general replacement for the 
TCP/IP stack lathei as a replacement m the specific environment of distributed com 
putations over LANs 

In the next section ve discuss the distributed computing environment In § 42 we 
discuss the desiiable characteristics of a communication protocol for the target distributed 
computation environment In § 4 3 we present the design of the LightCommunicator 
stack The laycis in this stack are explained m §4 4 and §4 5 
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4 1 The Distributed Computation Environment 


In the following wc dm 'icterise the distributed computation environment and identifv 
the featuus that make substantial simplifications to the communication protocol stack 
possibk We assume that the distributed systems are interconnected through an off the 
shelf multiple access LAN like for example a 100Mbps Ethernet We first describe the 
characteristics of this LAN in terms of its topology throughput and error performance 
We then charactciisc the natuie of data transaction in the certain kinds of distributed 
computations foi w hich this protocol has been designed 

4 11 The Underlying Network 

In the clc sign of LightCommiinicator we make the basic assumption that all the nodes 
uc on the s unc iic twoik, i e , the messages do not cross routers This is a reasonable as 
sumption bee iiisc, typically, routers introduce consideiable delays in the communication 
path and will dctc'iiorate the performance of distributed computations 

In a LAN oiiMionrncnt typically, we can expect a, friendly behaviour of the network 
towards the packet that are being floated This is because of many reasons Firstly 
chincGs of (he network introducing errors is very low and can be neglected and the 
checksum of the MAC layer packet trailer is sufficient This saves us from introducing an 
extia erior dctection/correction overhead This is because the Ethernet error detection 
mechanism is known to be quite reliable Secondly the packets always received in order 
This IS because Ihtie is only one path betw'een the sender and the receiver This feature 
helps us to detect a packet loss If a packet arrives before its predecessor arrives the 
piece cling p ickct IS lost 

The round trip delay fiora the sender to the receiver is very low and we can expect 
a prompt response from the peer piotocol at the destination This allows us to avoid 
unnecessary delays by waiting for long when acknowledgments are not received at the 
sender and having low timeout values for the retransmission timei This is reasonable 
because the end to end delay is only the time taken to transmit on the network and the 
device driver processing at the sender and the receiver For example if a packet is not 
followed by another one within a small interval, we can assume that there is something 
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\Mong and corus])ondiiig corrective action can be initiated 


4 12 The Transactions 

In a distributed computation environment, typically there is intensive data transaction 
and each transaction is predominantly unidirectional This means that the overhead 
introduced to increase bidirectional throughput by piggybacking acknowledgments on 
dat a or waiting foi a segment to be filled when a partial frame is ready at the sender can be 
avoided Also most distributed computing is done using message passing libraries which 
usually follow a master sla\e architecture In this architecture a tjpical transaction will 
comprise of opening of a connection, unidirectional transfer of data and then closing of 
the coniiGcfion Since communication is not interactive like for example in telnet the 
data flow is picdommantlv unidirectional and onl} one side is sending at a time Such a 
connection should ideally be controlled by the sender only 

In the emnonment that we are considering we can assume that if a host is not 
responding promptlj, it is down This is reasonable because a packet cannot be blocked 
at any place m the network because there are no routers between the sender and the 
receiver to stoic the packets 

Any kind of flow contiol is unnecessary because there is no router between the sender 
and the receive i Also, the Ethernet collision resolution mechanism is an effective flow 
control scheme This me ans that a lot of the flow control mechanism o\erhead that TCP 
has to live with c xn be axoided 


4 2 Defining the Characteristics of a LightCommu- 
nicator 

The lightweight protocol described here is meant to be a replacement for the TCP/IP 
stack m distributed computation applications Thus the biggest requirement is that the 
TCP/IP interface be mimicked by the LightCommunicator stack i e , LightCommuni 
cator must simulate the interface provided by TCP/IP to application programs both m 
teims of the functions called and the interpretation of the parameters passed to/from 
these functions In this respect a considerable amount of the compatibility is ensured bv 
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the implement ition of the socket interface What is required is that the implementation 
of the methods st xted m § 3 2 3 should be identical to that of TCP/IP externalh and 
they should tak the same arguments in the same format as the corresponding methods 
in TCP/IP implementation The following example illustrate this requirement When 
TCP/IP IS used foi communication, the application programs address the destination 
using IP addi esses Therefore the protocol must accept IP addresses and port numbers 
If any other addiessing mechanism is used the conversion should be internal to the pro 
tocol code Similarly, on the receiving side it should report to the application laxer 
with IP addresses and port numbers instead of its own addressing scheme For the same 
reason obseive that implementation of any facility to the application layer other than 
that provided by TCP is redundant because that facility is not going to be used 

Also, not( that theie are many utilities which have no meaning when TCP is replaced 
by some othd ti mspoit layer protocol and IP is replaced by some other network la\er 
protocol Poi txxinple setsockoptO/getsockopt () functions have options to set and 
get the window siac But if the transport lavei protocol does not use window flow con 
trol this option lixs no meaning Similarly routines to control segment size also lose 
tlnir siguificaiue and some of the networking layer routines are also rendered meaning 
less However those should not be left unimplemented, but should be implemented m 
a lohvant way For eximple, the function to get the window size should return some 
sensible value so as to not confuse the application program 

4 3 The Stack 

LighiC ommuiiK atoi us< s stop and wait with selective negative acknowledgments to pro 
Vick 1C liability The application data is broken into fragments and the fragments into 
pickets Packets corresponding to a fragment are sent back to back without waiting 
for acknowledgments After a full fragment has been sent the receiver sends requests 
to retransmit the packets which have been lost If all packets are received correctly, an 
empty icquest is sent On receipt of an empty request, the sender then proceeds to the 
next fragment Timer arc used to tackle unusual situations 

The organisation of the LightCommumcator lightweight communication protocol fol 
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APPLICATION LAYER 


Fragmentation (Transport ) layer 
-establishment and termination of connection 
-break application data into < 32 k fragments 
-reassemble fragments on reception 
-serve retransmission requests 
-maintain a reliable pipe of data flow 


Packetization (Network) layer 
map network address to hardware address 
break fragments and form packets 
-reassemble packets to form segments 
-send retransmission requests 


Data Limk Layer 
-accessed through device layer 


Figure 4 1 The stack used bv LightCommunicator 

lows that of TCP/IP The network layei is replaced by a Packetisation la}er and the 
tianspoit 1 i’s( 1 by a Fi igmentation layer The functions of these will be described later 
riu d(MC( layei used bv TCP/IP is also used by LightCommunicator and its imple 
menial ion in the Linux kcinel can be used (The device layer includes the deuce dri\ers 
iiid llu umtiiKs to atci&b device drivers) Figure 4 1 shows the complete stack and the 
opeiations pcifoinied m each layer 

The idvantage of using the same stack as that of TCP/IP is that the implemen 
tation of the netw^ork module in Linux supports such a structure Moreover, splitting 
of application data in two levels helps in maintaining a continuous flow of data and an 
efficient management of acknowledgments and retransmission of packets and fragments 
IS possible 

A close look at the LightCommunicator stack will show that the stack does violate 
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the peer to peer stiuctuie recommended for laeered communication architectures For 
example, the rctiansmission request or nacks are sent b> the Packeti&ation lajer of the 
sender and these iie sened by the Fragmentation layer at the receiver This is done to 
improve the thioughput as will be explained later 

The various modules of the protocol are organised as shown in Figure 4 2 There is 
only ont module constituting the Packetisation layer but the Fragmentation layer is made 
up of three modules - a fiagmentation module for fragmenting the application data into 
32 KB fragments rnd transmit them a connection module to deal with establishing and 
termination of connections, an acknowledgment management module to serve acknowl 
edgment which is different from the data transmission part of the fragmentation module 
but a proper SYiichionisation is mamtamed between these two modules 



Figure 4 2 Organisation of various modules constituting the protocol 
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4 4 The Packetisation Layer 


As shown in Figuu 4 1 the Packetisation la^cr deals with translation from network ad 
di esses to hiidw irc addresses In fact this In or does two lerels of translation Since the 
Fragmentation li>ei gets the IP addresses from the application it supplies the Packeti 
sation layer with the IP addresses The Packetisation la^er translates this IP address 
to the LightConimunicator address and fills the LightCommumcator header Finally it 
gets the haidwrre header from a routing table maintained by it and fills the hardware 
header The hardware addresses cannot be used bj LightCommumcator because this 
information is not handed down by the dec ice 1 uer of Linux when a packet is received 
The application will need to know the source of the recened data Thus to preserve 
source infoini itioii Light Coinmunicatoi Ins its own addressing mechanism 
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Figur( 4 3 The 16 byte LightCommumcator header 


4 4 1 The Header 

LightCommumcator uses a 16 byte header which incorporates information corresponding 
to both the nctw'ork layc r and the transport layei of the OSI stack The header structure 
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15 shown m Figui( 1 3 The t'\\o node addresses are the protocol addresses of the two nodes 
involved in tin coimnunication The two port numbers are the source and destination 
port numbers is used b\ TCP The offset field gnes the position of the particular packet 
from the staiting of the tiansport laier fragment There is one bit MP (more pack^i^^) 
indication m tlu offset held to signify that more packets follow the packet in the current 
fragment This held is set m all packets belonging to the transport layer fragment except 
the last picket The length field specifies the length of the whole packet (including the 
header) The sequence number will be discussed in § 4 5 The header length gives the 
length of the he ulei This is not necessarj for present implementation, but it will be 
needed once the IP options hd\e to be included 

4 4 2 Packetisation of Transport Layer Fragments 

The width of the offset field of the protocol header is 15 bytes which imposes a limit on 
the ti inspoit layei fi igment si/e of (32+1 5)K bytes (32 KB due to the size of the offset 
field m the packet headei and an extra 1 5 KB that can be accommodated in the last 
picket) Howe lei this would lead to a 136 byte packet for the last one in a fragment 
To avoid this the miximum sue of a fragment has been reduced by 136 bjtes to 34132 
bytes This m ikes a total of maximum 23 packets m a fragment All these calculations 
have been done assuming that the deuce MTU is 1500 bytes and the header length of 

16 bytes 

The riagmentation laiei requests the Packetisation layer to send a fragment of data 
If the si/e of the liagment is larger than the device MTU the Packetisation layer breaks 
It up into smalle'i p lekots The initial packets aie made of size equal to one header length 
less th in device MTU The lesidual bv+es are adjusted in the last packet 

After the liagment is handed over to the Packetisation layer, it packetises the fragment 
and sends the sc jiackets in the reverse order i e it will transmit the last packet first and 
the first packet ooircspondmg to that fragment last The advantage of this strategy is 
that the receive i c in know the sue of the entire fragment at the arrival of the very first 
packet belonging to that particular fragment The packets, when floated on the network 
are received and handed over to the protocol as explained in § 3 3 2 The Packetisation 
layer at the receiver composes the segment as will be explained in § 4 4 3 and hands it 
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to the fiagnunt ition module 


4 4 3 The Reassembly of a Fragment 

At the leceivci theit is a queue for each fragment that is being received Note that 
coru spending to u connection there is only fragment being received at any time These 
queues aie hnkd into a global list A queue for a fragment is created when the first 
packet of the fi igiiunt nines Whenevei a packet arrives the queue for that fragment 
(identified b} tin ])iotocol header) is searched in the list If it is found the incoming 
sk_buff is added to the queue if it is not found a queue is created and the sk-buff is 
added to that 

When i ]) uk( t with offset 0 arrives, indicating all the packets of the fragment ha^e 
been tiansmitted, i le issembly of the fragment is attempted If all the packets are 
(U iilibk m tin ciiuiu, the fidgment is composed and gi\en to the upper laver If there 
aic packets missing, i letiansniission request is sent for all packets which are missing 
The dc tills in (xpkimed m § 4 44 If packets are delayed or experience MAC la-\er 
ciiois time outs occur to maintain the correctness of the protocol The use of these 
timers are e\pl lined m §4 4 5 and §4 5 3 

• • • 

• • • 


1 iguK 4 4 The structure of a retransmission request message 




4 4 4 The Retransmission Request 

If tlio r™<iembh fails the receiver sends a retransmission request to the sender The 
requests arc of the form shots n m Figure 4 4 For every hole in the reassembled fragment 
the starting point of the hole and the length of the hole is sent The offset is found from 
the end of the p.evious packet which was received and the length ,s found from the offset 
of the next packet which is received A retransmission request is identified with a different 
protocol field ,n the header These requests wrll be served by the Fragmentation layer of 
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the s( nd( i (Sj 4 5 2) ind this violates the teiuts of la\ered communication This is done 
bee uis( th( iiifoiiii ition ibout the loss is first available at the Packetisation ia\er and 
tluie IS no point in climbing up the stack to send the requests and unnecessanh delaying 
the pioceduie Also, the packets are available only with the Fragmentation layer and 
there IS no point in hav mg the Packetisation Layer asking for them 

After the transmitter receives the lequests for retransmission it will resend those 
pickets ivhosc utiaiismission is requested When these retransmitted packets arrive 
they are added into the queue as a normal packet and assembly is attempted again 
The above pioceduie cannot be done indefinitely There is a counter maintained for 
each queue which keeps tiack of the number of retransmission requests sent for each 
cpieue Once this numbei exceeds a predefined threshold, steps are taken to close the 
coniKCticm lo do this, the pioccss that has requested the receive has to be aborted i e 
the lop half ol the lutvvoik module of Linux For this purpose a packet to abort this 
pioccss IS composed uid is put m the receive queue of the sock for that connection 
When the top h ill sees this packet iii its queue it will just return without doing anything 

4 4 5 The Receive Timer 

Situations might iiisc when the transmitter may fail and the receiver is waiting to receive 
p ickcts mcl fi ignic nts To not do this indefinitely a receive timer has to be maintained 
bmee most of tin input pioccssing is done at the Packetisation layer it is convenient 
to mamt im tin timci m this lavei This timer has three different timeouts for three 
diffcicnl St itcs llu hist stitc corresponds to the situation when the receiver is receiv 
mg i ft igmcnt uid has not received it completel} Since in this transaction there is 
no h inclsh ikc lx ivv c c n I he ti insmitter and the receiver and the transmitter just pumps 
ill packets m » Stic un this state has the lowest timeout value In our implementation 
wc set this timeout to be one second On expiry of this timer, the timer function at 
tempts a icasscmblv of the fiagment being received Since the fragment is incomplete a 
retransmission iccjuest is generated and transmitted as explained in § 4 44 

Once the whole fiagment is reassembled (successfully or otherwise) an indication 
has to be sent to the transmitter and the next fragment will be obtained when this 
indication is received at that end Therefore this timeout should be at least double that 
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of tho pi p\ ions one On (\pn\ nf this a reassembly is attempted 

Oiicc rceenc the final fiagment coriesponding to a message a large timeout is 
assigned to this tuner and will serve as a keepalive timer so that a connection which has 
been inactive foi i king time can be closed On expiry of this timer the connection is 
closed 

From the ibenc it is ctident that at the recenmg side having the timer in the Pack 
etisation I ne i is be tlci bee luse it is here that the the maximum amount of information 
about the st it ns of i connection is a\ailable 

4 5 The Fragmentation Layer 

The r(liabilit> ol diti ti uisiction is ensured by the Fragmentation layer This layei 
consists of tine e meielule s as shown if Figuie 4 2 The skeleton is proaided by the Frag 
nxvtatiov hhnhih wlneh is responsible foi breaking up the user data in fragments of 
32 Kbytes anel hinds the in to the Packetisation layer for further processing The Con 
nation Manuqcninit Moduli establishes and closes a connection The Acknowledgment 
Manaqirmni moduli deals with the incoming acknowledgments This division of tasks 
nukes the code ele ui uicl modulai and reduces some of the processing overheads These 
modules uc explained below 

4 5 1 The Fragmentation Module 

Ilns module leeenis ip])lication data and the destination IP addresses as its input The 

piimaiv e)b)C cine eif tins module is to ensure reliable delivery of this data to the recenei 
It use s 1 stop and wait v dU selective nacks fiom the recener at the end of a fragment 

tocnsiue leliable elelneiy of data 

Tilt dii))lu,ilio« daui IS divided into fiagnieiits ot 32 Kbytes and each fragment is 
passed to till Pallatisation layu where it is further broken into packets and sent mei 
tht network Hit Pack tisation layer receives the destination address, the length of the 
fragment and a pointe. to a callback routine to copy the data bjtes from the application 
area to the kernil area After giving the fragment to the Packetisation layer the process 
blocks and waits foi a letransmission request from the receiver If it gets an empti 
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request, indicating successful leccipt of a fragment it proceeds to the process the next 
friginent If tin icccivci xsks foi selective retransmission onh those packets are sent 
which are being requested and not the whole fragment At this point also there is 
violation of the basic layeied structure The aim of this violation is to improve the 
efficiency of the tiansaction The acknowledgments cannot be served by the Packetisation 
layci beciuse this would necessitate multiple copies of the application data 


4 5 2 Retransmissions 

Retiansniissions b> the tiansmitter are based on the information provided by the negative 
ickiiov Icdgmc nts sent b> the receiver From the retransmission request the transmitter 
obt uns the ofise ( and the length of the data to be sent There can be multiple packets 
in the iiagnunl tint need to be retransmitted 

Reti uismissions aie li nulled by the same loutines as that used in first time trans 
missions Ihe onh diffiieiicc is m the callback mentioned in § 4 5 1 In the callback 
if it IS a letiansmission the offset field is changed offset of the hole indicated bj the 
rctiansmission le quest Ihis offset is obtained bv adding the offset provided by the re 
ti msmissioii le quest to the offset set by the Packetisation layer The MP bit is also 
modified aecoidiiigly Aftei completion of a retransmission a counter that counts the 
numbe i of le ti uismissions is mciemented When this counter exceeds a threshold if still 
negatnc ickiumlcdgmcnts eontinue to be recened serious network error is assumed and 
the eoiiiuetion is closed 

4 5 3 The Transmit Timer 

The liansmit tiinei is pi iced in the Fragmentation layer because most of the processing 
m the sc ndmg of data is done in this layer Unlike the receive timer it operates only 
in one mode This timei is used to know the health of the connectivitj to the recen er 
If, after a fragment is transmitted, the acknowledgment is not received before timeout, 
the whole segment is retransmitted If a timeout occurs on the retransmission, serious 
network eiioi is assumed and the connection is closed 
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4 5 4 Acknowledgment Management Module 

The dtkno’wkclginents up identified by the protocol field m b\ the acknowledgment 
packets Tlu piotocol hiiidlei recognises an incoming packet as acknowledgments and 
queues them into i queue which is diflFerent from the one meant for the data packets 
(obviously*) The routine that serves the acknowledgments will wait for a retransmission 
icqucst iftcr tlu fi igincnt is handed to the Packetisation layer It pops the request from 
the queue and t il ( s further action as explained m § 4 5 2 

4 5 5 Connection Management Module 

LightCommumc xtoi like TCP, is a connection oriented protocol Before starting the 
tiAiisdction, d connection hds to be established This connection has to be closed after 
the limsdction is compkte 

The comuclion establishment is a two wa} handshake This is sufficient because the 
louiid trip del r\s lie low There is an interface from this module to the application like 
111 the Pragme 111 It 1011 module 

Thismoduk is the implementation of the system calls listen () accept () connect () 
and close ( ) Tlu application, when it wants to transmit data sends a connection request 
to the intonch d ic ceiver This request is identified from the protocol field A.fter sending 
the roqiu st , it w aits for an ackiiow ledgment of the request On the recen er side, after the 
reception of tlu connection lequest an acknowledgment is transmitted and connection 
IS ekehieel to be established When this connection acknowledgment is received at the 
scncki, the eoniuetion is fully established 

Similarly, closing of a comicetion is also a two way handshake and follows the same 
liroeedure as tint of opening of a connection This is fairly reasonable because the 
ti uisaetion is pie ekunmmth uniduectional This type of one wa\ close does not prohibit 
a full duplex ti ms iction but onh prohibits the user to do a receive () after a close () 

4 5 6 The Other System Calls 

Apart from the above modules, there are a number of system calls like dupO bindO 
etc that are implemented m the same way as in TCP/IP 
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Chapter 5 


Performance and Conclusions 


5 1 Performance 

Wc iKW pi( s( nt (Ik icsults of delay measurements on LightCommunicator and TCP/IP 
Mc&sag(s of difltunt sues were tiansacted o\er a lightly loaded network and the 
end to end dd lys weie rnei&uied The following experimental setup was used A simple 
ehent sei\ei piogi iin seiiels a message 20 times using LightCommunicator and TCP/IP 
These messages uc sent to the server which finds the source address from the LightCom 
munie vtor (oi 1 CP/IP) he tder and echoes the message back to the source The dela\ at 
the souree fioin the instant at which it is submitted to the kernel is measured for each 
ti ansae tioii The mean and variance of these delay measurements are obtained The 
expeiiment is le petted foi 20 different sizes of the message The dela\ measurements 
use the clock 0 funetion of Linux This function proeides the active time taken by the 
pioeess and its eluldion For our experiments this does not measure the total CPL 
load of the piotoeol but latlior gives us the times for which the process is not blocked 
It may be noted that TCP/IP generates a higher CPU load than LightCommunicator 
Foi example , TC P adds five timers per full duplex connection putting a relatively larger 
load on the seheduler whereas our implementation of LightCommunicator uses only two 
timers Similaily the bottom handler is also an independent process m itself There are 
many othei small overheads which are not captured by this experimental setup Thus 
the results that we report from our experiments does not capture other advantages of 
LightCommunicator over TCP/IP The real test of the utility of LightCommunicator will 
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Delay (m seconds) 


be to use it in i disliibiited computing message passing librarj like MPI 

The nieusuK me lit n suits are shown in Figures 5 1 and 5 2 Figure 5 1 shows the 
sample mein of the dd from the 20 transactions for each message size for both Light 
Communicator iiiel TCP/IP and Figure 5 2 shows the sample coefficient of variation 
(ratio of stand iid deviation to mean) for both the protocols 



F igiire 5 1 The means of the delays for both protocols 


The lesults show tint the coefficient of eariation of TCP/IP is considerably larger 
thin that of I ightCommunicitoi Considering this with the fact that the mean delay 
m LightCommumcatoi is lower the delaj \ariance for TCP/IP is much higher This 
UK leased \aiiuue of TCP/IP can be attributed to various paths that a message can 
tike through the piotoeol stack at the source and the destination The difference m the 
route of 11 ivig itiori along the code may be a reason for a larger \ariance 

If we fit i straight line to the sample mean graphs we get following results 
TCP/IP mcaiijcpix) = 0 088x + 0 003 where x is packet size in Kbytes 
LightCommunicator memiLcix) = 0 043x + 0 001 where x is packet size in Kbytes 
A linear fit is sufficie nt because an attempt to fit a quadratic curve results in a verj 
low coefficient of the second degree term The slope of the line of least square fit of 
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Coefficient of variation 



Figure 5 2 The Coefficient of variation of delays 

the TCP/IP cl(h^ is twice that of LightCommunicator For larger packet sizes the 
piocessmg del ns beeenne piedominant thus LightCommunicator performs considerably 
better than TCP/IP 

Another eibsen ition fiom oui experiment was that 8 4x10® bytes were transacted in 
two directions using both the protocols No packets were lost m the transactions using 
LightCommume itor We eaniiot say anything about the losses when using TCP/IP 

5 2 Summary and Future Work 

W'c have developed LightCommumcatoi, a lightweight communication protocol and im 
plornented in the Linux kernel The first step was to point those areas in TCP which 
are pure overheads m friendly situations Following that, we enumerated the desirable 
characteristics of a lightweight communication protocol for distributed computation in 
friendly environments, spccihed LightCommunicator and implemented it into the Linux 
kernel To the application layer LightCommunicator appears very similar to TCP/IP 
This means that we could modify the TCP/IP code to obtain the LightCommunicator 
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code Bcc uisc of th( sinuhritj of the TCP/IP and LightCommunicator the modifications 
lequired in the application to use LightCommunicator for communication is minimal 
The onh ch ingc th it m cds to be done is while creating a socket interface b\ calling the 
socket 0 system call Oui experiments on the delay performance of LightCommunicator 
VIS a MS TCP/IP show that the transaction delays can be reduced by more than half b> 
using LightCommunicator instead of TCP/IP This advantage improves as the volume of 
data exchinged during a transaction increases 

Our implementation is not yet efficient and some inefficiencies can be removed On 
the sending side, data copying has been reduced to one per message which is the best 
that can be done but on the leceivmg side we still have two instances of data copying 
pci message This can be i educed further to one which will reduce the overhead further 

TIkk is ilso i possibihti to implement /onuardmp This would be helpful if we w ant 
to use i teipologi othei than i bus This will enable us to float multiple packets simulta 
iieoush em the inulti(om])utei interconnects For example m a master-slave architecture 
a slai topology in pio\e to be veiy adiantageous Similarly toroids and grids may also 
be constiuetod Opeiatmg system support is also available for this purpose since Linux 
can suppoil multiple Etheinet cards 

Development of net faoh foi the protocol can also be done Currently all the mfor 
mation (about iddiess and mappings) is stored m a file Tools may be written (similar 
to if config) This would make its use easier 
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